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GOALS 


♦  To  support  multiple  classes  of  military  missions 

with  a  single  morphable  architecture 

♦  To  eliminate  processing  system  redundancies 
through  rapid  dynamic  reconfiguration  of  front-end 
filtering  and  data-reduction  processing 

♦  To  reduce  application  development  costs  by 

allowing  the  hardware  to  be  mapped  to  the  algorithms 
both  statically  and  dynamically 

4  To  develop  an  architecture  that  can  quickly  and 
efficiently  adapt  to  changing  situations  -  internal 
(fault  tolerance,  sensors  configurations)  and  external 
(threats  change,  mission  phasing,  environment) 
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Key  Ideas 

Combines  fine,  medium  and  coarse  srain  vrocessins 

resources  on  a  single  chip 

Matches  hardware  to  the  aleorithms  and  the  control  flow 
mechanisms 

Confisures  memory  structures  for  efficient  front-end  and 
back-end  processing 

Provides  flexible  simbvte  I/O  channels  for  direct  interface 
to  sensors  and  inter-chip  communication 

Supports  all  systems  processing  requirements  with  a  single 
MONARCH  chip  type 
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Approach 


♦  Leverage  DARPA-sponsored  DIVA  Project 
results,  Raytheon  IRAD-sponsored  HPPS  and 
Mercury  Stream  Co-procesing  Engine 

♦  Use  DoD  missions  to  drive  micro-architecture 
and  morphing  concepts  and  implementation 

♦  Determine  the  “sweet  spot”  for  mixing  large, 
small-  to-medium  and  fine-grained  elements 

♦  Through  experiments  and  simulations 
demonstrate  a  “single  chip”  VLSI  processing 
architecture  based  on  DIVA  and  HPPS 
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DIVA  Leverage:  The  Chip 
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Exploiting  The  Bandwidth  in  a  System 
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DIVA  Solutions: 

•Move  concurrent  processing  on-chip 
•More  bandwidth  and  less  latency  on  chip 
•Added  bandwidth  between  memories 
•Lower  latencies  throughout  system 
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DIVA  Software/Hardware 
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Host  Runtime  Laver 
Synchronization,  Flushing, 

Thread  Management,  Host  Parcels 

Application 
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Host-PIM  Mapping, 
Parcels,  Coherence 


PIM  Backend  Compiler 
Code  generation  for  scalar 
and  WideWord 
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PIM  Runtime  Kernel 
Parcel  Management, 
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PIM  Context  Switches, 
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Memory  ■  PIM  VLSI  Devices 
Controller  I  Processor,  Memory  Array 
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DIVA  Node  Architecture 
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DARP^  WideWord  ALU  Data  Flow 
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KISS:  More  compromises  in  architecture  to  enable  early  prototype 
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DIVA  PIM  Chip 


Purpose 

Demonstrate  bandwidth  advantages  of 
PIM  technology 

Key  architectural  components 
High  memory  bandwidth 
256-bit  WideWord  processing 
PIM  routing  component 
Chip  statistics 

9.8mm  X  9.8mm  in  TSMC  0.18pm 
~200K  iogic  ceils  plus  8Mbit  SRAM 
352  pins  (241  signal  pins) 
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HPPS  &  FPCA  ARCHITECTURES 
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High  Performance  Processor 

System 


♦  Multinode  Processor 

♦  One  custom  ASIC 

♦  Innovative  voting 

♦  Inputs  for  high 
bandwidth  A/D 
receiver  channeis  or 
FPCA 
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HPPS  Node  Architecture 


ault-Tolerant  Intf.  Core 
-  I/O  Capability 


-  Intra-node 

-  Inter-node 

-  Distrib.  Crossbar 

-  Fault  Detection 

-  Memory  EDAC 

-  Processor  voting 

-  Node  configuration  controls 
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MONARCH:  Node  Architecture 
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Note:  Not  to  Scale 
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Node  Control 

•  Self  test 

•  Initialization 

•  Fault  Mgmt 

•  Reconfiguration 
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Virtual  WideWord  Unit/ 
Data  Flow  Unit 


i^ault-Tolerant  Intf.  Core 

-  I/O  Capability 
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-  Distrib.  Crossbar 
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-  Memory  EDAC 

-  Processor  voting 

-  Node  configuration  control 
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MONARCH  ARCHITECTURE 
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MONARCH  Processor  System 


♦  Multinode  Processor 

♦  One  MONARCH  chip 

♦  Innovative  voting 

♦  Inputs  for  high 
bandwidth  A/D 
receiver  channels  or 
direct  chip-to-chip 
data  transfer 
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MONARCH  Single  Chip  Architecture 
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♦  800  MHz  Clock 

♦  512  ops/clock 


♦  12GFLOPS 

♦  400  GOPS 


♦  32  MBytes  DRAM 

♦  320KBytes  SRAM 


♦  256  MALU 

♦  36  Watts 
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MONARCH  Application  Processor 
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DARP/^  MONARCH  Architecture  Features 


♦ 


♦ 


♦ 


♦ 


Dual  native  mode,  high  throughput  computing 

-  Multiple  wide  word  threaded  (instruction  flow)  processors/chip 

-  Highly  parallel  reconfigurable  (data  flow)  processor 
Large  on  chip,  multiport  memories 

-  High  bandwidth  access  to  memory 

-  Extensible  with  off  chip  memory 
High  speed,  distributed  cross  bar  I/O 

-  Integrated  with  chip  processing 

-  Scalable  I/O  bandwidth  -  multiple  topologies 

-  Direct  connect  to  high  speed  I/O  devices,  e.g.,  A/D’s 
Rich  on  chip  interconnect 

-  Supports  on  chip  topology  morphing  and  fault  tolerance 

-  Supports  multiple  computation  models  (SISD,  SIMD,  DF,  SPMD,...) 
On  chip  Morph  -  Program  bus  and  microcontrollers 
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Architecture  Merger*  Features 

-  Mostly  a  complementary  match  and  enhancement  - 


ISSUE 

APPROACH 

BENEFIT 

256  bit  wide  word  processing  unit 

Each  Arithmetic  Cluster 
has  8  32  bit  units 

1  AC  provides  same 
width  as  WW  unit 

Instruction  Set  Mapping 

Basic  functions  same 

Need  to  add  some  insts 

Little  impact 

Large  On-chip  memory 

Similar  to  Edge  Memory 
Now  can  have  on  chip 

Performance  boost 

5  State  pipeline,  instruction  flow 
decoder 

Retain,  and  mux  decoded 
signals  with  DF  signals 

Some  hardware  growth,  but 
more  control  modes 

Data  flow  control  mode  -  streaming 

Retain  -  switch  mode  bit 

As  above 

High  speed,  multiple  channel  I/O 

Incorporate  dist.  xbar  and 
use  for  parcel  com 

Improved  I/O  performance 

Parcel  communications 

Retain  and  map  onto  other 

physical  protocol 

Little  impact 

On-chip  micro  controllers 

Retain 

Performance  boost 

*  Merger  of  features  from  DIVA  and  HPPS  processors 
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Architecture  Merger  Issues 


ISSUE 

APPROACH 

IMPACT 

WideWord  8-bit  math 

Modify  array 
carry-chain  logic 

Negligible 

delay 

Thread  control  for  array  /  WideWord 

Switch  RISC  pipeline 
control  into  array 

TBD 

3-port  WW  register 
file  implementation 

Extend  array  arithmetic 
clusters 

Small  area  increase 

WideWord  pipeline 
length  /  bypass 

TBD  /  Simulation 

Interconnect,  Compiler 

Minimum  1  Cache  size 

Simulation 

Area 

Data  exchange:  \N->S  /  S->\N 

TBD 

Area,  Interconnect 

I/O:  Memory  map  or  program? 

Memory  Map 

None 

WideWord  shifter  implementation 

TBD  (modify  array) 

Design  complexity 

Permute  implementation 

Enhance  array 
x-bars  for  8  bit  data 

Small  area 
increase 
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FPCA  Changes  for  WideWord 


Inst 

Decode 


♦  OP 

♦  Data  valid  (default) 

♦  Consume  (default) 

♦  Token  (participation) 


Multiplexor  for  dual  mode  instruction 
set  control  of  elements 
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- ►  32 

8  32-bit  ALUs  become  32  8-bit  eiements 
32  word  register  file  added 

“Breakable”  carry  chains 
(Actually  still  32  bit  processing 
elements,  but  condition  codes  and 
carries  controilabie  at  8  bit  boundaries) 
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MONARCH  I/O  Summary 

number 
of  ports 

Wires  per 
port 

Total 

Wires 

Type 

Clock 

Rate 

High  speed  ports 

12 

50 

600 

LVDS 

1  GHz 

Inter  FPCA  Links 

4 

52 

208 

LVDS 

1  -2  GHz 

External  memory 

1 

160 

160 

CMOS 

500  MHz 

standard  I/O 

2 

60 

120 

variable 

100+  MHz 

Total 

1088 
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Need  to  Select  Preferred  Parameters 
for  1st  MONARCH  Chip 


Memory 

(Mbytes) 


MONARCH^ 


Backplane  Interconnect 


MONARCH  Processing  Card 

-  6Ux160  double  euro  card  form  factor  - 
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♦  6  MONARCH  chips 
+  memory  and  power 
conditioning 


♦  75  GFLOPS 

♦  2.4  TOPS 


♦  192  MBytes  on-chip  DRAM  ♦  1  GBytes  on-board  memory 

♦  2  MBytes  on-chip  SRAM 
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Summary  &  Conclusions 


♦  MONARCH  features  very  attractive  for  multiple 
applications 

♦  Merger  of  two  existing  architectures  shows  good  fit 

-  “Complementary”  but  compatible  features 

-  Rich  experience  base  allows  quick  design  trades 

♦  “The  devil  is  in  the  details”  —  a  lot  more  work 

-  On-chip  DRAM  organization  and  access 

-  Support  for  “morphing” 

-  Simulation  results  at  application-level 

-  Trade  offs  for  FPU  capability 
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